Bootstrapping Relation Extraction Using Parallel News Articles
نویسندگان
چکیده
Relation extraction is the task of finding entities in text connected by semantic relations. Bootstrapping approaches to relation extraction have gained considerable attention in recent years. These approaches are built with an underlying assumption, that when a pair of words is known to be related in a specific way, sentences containing those words are likely to express that relationship. Therefore, sentences containing the pair of words may be used as training data for the relation extractor. We test this assumption for various relations drawn from two domains, using parallel and non-parallel corpora of news articles. We find that the assumption holds with substantially greater probability for parallel corpora. 1 Background and Related Work Due to the enormous variety of expression in natural language, semantic structures can surface in many ways. Finding each of these ways manually, either by annotating or constructing interpretation rules by hand is impractical. One goal of DARPA’s Machine Reading Project is to minimize this human effort when moving to new domains. Semi-supervised bootstrapping methods in relation extraction are a natural choice because of their ability to benefit from a small amount of manual effort and a large, unannotated corpus. The Machine Reading Project’s goals are much deeper than just relation extraction. Our approach is componentwise, and we consider relation extraction to be one component of the machine reading task. Other components include word-to-concept mapping including word sense disambiguation and named entity recognition, and co-reference resolution including event co-reference and cross-document coreference. These components are not totally separable, each one constrains and informs the others. We use Markov Logic Networks [Domingos et al., 2006] (MLNs) to perform joint inference over the outputs of the natural language processing components and the background knowledge. Algorithm 1 gives a general outline of relation extraction bootstrapping. This algorithm bootstraps an extractor E for a single relation, although in some cases it may be beneficial to bootstrap multiple relations together [Carlson et al., 2009]. It begins with a corpus, C, and a seed set of related pairs of instances, R, for example {(Cavaliers, 109), (Suns, 91)...} for the teamScore relation. Alternatively, we may obtain seeds from a high precision (but low recall) relation extractor, in which case we first run the relation extractor over the corpus to obtain R. The first step of the bootstrapping loop is to label all occurrences of the pairs from R in C. Then an extractor is trained to identify the labeled sentences as examples of the target relation. The process repeats with the trained extractor providing an expanded set R. Algorithm 1 A general relation bootstrapping algorithm Input: R: A set of instance pairs for a target relation. C: An unlabeled corpus. Output: E: The trained relation extractor. Procedure: repeat O ← LABELOCCURRENCES(R,C) E ← TRAINEXTRACTOR(O) Rprev ← R R ← RUNEXTRACTOR(E,C) until R = Rprev return E In a seminal work, Hearst [1992] demonstrated a means of finding hyponyms from text using an initial set of simple syntactic patterns and a semi-automatically discovered set of additional syntactic patterns. Many researchers have also used parallel (or comparable) corpora to find paraphrases [Barzilay and Lee, 2003]. Shinyama et al. [2002] used parallel news stories to extract dependency paths with similar meanings based on shared named entity fillers. A related idea, distant supervision [Mintz et al., 2009], uses a database of known facts to label a separate corpus of natural language. Rather than beginning with a small number of seed instances, distant supervision uses a large database such as Freebase [Bollacker et al., 2008] containing hundreds of thousands of entities. This approach operates not only on named entities but also nominals. For example, the /people/person/profession relation relates a (named) person to a profession. Bunescu and Mooney [2007] used a method of multiple instance learning to train a relation extractor using a similar idea. Though there was no bootstrapping, the system relied on the fact that a large enough bag of sentences containing two named entities known to be in relation will contain a sentence stating that relation. Their system began with sets of positive and negative pairs for each relation. Many sentences relating the positive pairs do not actually state the relation and none of the sentences relating the negative pairs state the relation. Therefore the penalty for misclassifying positive examples was set to be one ninth the penalty for misclassifying negative examples. This parameter was set by trial and error. They did not employ a parallel corpus. Unlike previous work we consider not just entity-entity relations but also entity-event relations such as teamInGame. In these cases the event is almost never represented by a named entity and is often not represented by a noun phrase at all. Often it is a verb that refers to an event, such as “meet”, “play” or “win” for a game event. While previous work used either a parallel corpus or a non-parallel corpus, we use both with a common set of relations. In the context of pure relation extraction, extracting a instance pair for teamInGame such as (Packers, played) has little value. In isolation all we can conclude is that the Green Bay Packers have played at least one game. However, in the context of Machine Reading, labeling two instances in a sentence as related by a specific relation can have much higher value. A sentence such as “The Packers played the Steelers” may yield the the two extractions teamInGame(Packers, played) and teamInGame(Steelers, played) where the two extractions are connected by the single instance mention of “played”. Event co-reference may also enable the system to determine the date of the game referenced by “played” if it is stated in another sentence. 1.1 The Assumption of Bootstrapping Algorithm 1 relies on the implicit assumption that when we encounter a sentence containing a pair of words known to be in a specific semantic relation, that sentence is likely to express that relation. This assumption has been at least partially validated by the success of the NLP bootstrapping methods that depend on it. The focus of this paper is more narrow than bootstrapping relation extraction. We evaluate only a single function in the general bootstrapping algorithm: LABELOCCURRENCES. By determining its precision, we measure the degree to which the implicit bootstrapping assumption holds for various relations in parallel and non-parallel corpora. 2 Automatically Constructed Parallel Corpora The Machine Reading Project motivated relation extraction on two domains: National Football League (NFL) and Intelligence Community (IC). The relations in the NFL ontology were general enough to apply to any team sport, so we will often refer to this as the Sports domain. The abundance of news articles reporting on the same event makes parallel or comparable newswire corpora an attractive resource. Dolan et al. built an aligned corpus [2004], useful in paraphrase acquisition research, from news articles. We built a large parallel corpus for both the Sports domain and the IC domain from clustered newswire. We used Google News1 to locate and cluster articles describing the same story. Each document cluster covers a single news story, often a single event such as a game or terrorist attack. By searching Google News with domain relevant keywords, a set of results similar to Figure 1 is retrieved. By following the link to “all 285 articles” a set of documents all describing the same event can be gathered. Because both the relevance to the cluster and the quality (pagerank) of the articles decline with the number of search results retrieved, we limited the number of articles to the top ranked one third, or at most 100. Figure 1: Example of a Google News article cluster Each corpus is approximately half a gigabyte of text, gathered over about four months. On each day the corpus gathering software downloaded the clusters for the top ranked stories for each search term. The sports corpus was constructed by querying for news articles with the keywords: “NFL”, “NBA”, “NCAA football”, and “MLB”. The IC corpus was constructed from the keywords: “intelligence community”, “conflict region”, “Al Qaeda”, “Taliban”, “foreign election”, and “insurgent”. The sports corpus contains 145,000 documents across 3861 clusters, with an average of 37.5 documents per cluster. The IC corpus is similar, with 130,000 documents, 3114 clusters and 41.5 documents per cluster. The news articles were automatically cleaned of the boilerplate typical of web articles and stripped of all HTML. The resulting text documents were segmented into sentences using Stanford’s [Klein and Manning, 2003] sentence segmenter. Because of the way the corpus was gathered, we know only that the documents inside each cluster were considered to be about the same story by Google News. We do not know that documents in other clusters are not about the same story. Often, multiple consecutive days will have articles about the same event. Given our architecture, these would appear in different clusters. 3 Experimental Setup To determine the potential value of a parallel corpus for bootstrapping relation extractors, the experiment measures how consistently a relation holds between two seed instances within related news stories and across unrelated stories. Table http://news.google.com/ Relation Gloss Search Pattern NFL (Sports) Relations gameDate The game referred to as ’x’ was played on y “ on ” gameWinner The team, x, won the game referred to as ’y’ “ won ” gameLoser The team, x, lost the game referred to as ’y’ “ lost ” teamInGame The team, x, played in the game referred to as ’y’ “ between ” teamScore The team, x, scored y points “ scored ” Intelligence Community Relations eventLocationGPE The event, x, took place in y “ in ” hasCitizenship The person, x, is a citizen of y “ citizen of ” injuringHumanAgent The group, x, was responsible for the attack, y “ claimed responsibility ” isLedBy The organization, x, is led by y “ headed by ” Table 1: Relations for filler extraction 1 shows the five relations from each domain we considered. The gloss is an English explanation of the relation, where x and y are the two related entities. The search pattern is a snippet of text we used to locate sentences in the corpus containing entities in the relationship of interest. Using the search patterns we first gathered sentences that potentially expressed the relation. From these we manually located sentences that actually expressed the target relation and located the words that were related. We refer to the pairs of words in these sentences as the seed instances. These pairs would be the initial set R in algorithm 1. Because all sentences in which the seed instances are syntactically related will be labeled as occurrences of the target relation in the next bootstrapping step, it is essential that the instances be suitably specific. If the seed instances are overly general, such as (group, he) for isLedBy, the LABELOCCURRENCES process will have low precision. If the seed instances are too specific, such as (The surging Los Angeles Lakers,Monday′s game) for teamInGame, very few sentences will be labeled as occurrences of the relation. To balance these considerations, we adopted the following seed instance selection strategy. First, no pronouns were accepted. For teams, either the name or city was acceptable. For people, a last name or descriptive general noun like “defendant” was accepted. For organizations referred to by name, only the most descriptive portion of the name was retained, for example “Harvard” for “Harvard University”. To simulate the LABELOCCURRENCES function in algorithm 1, a simple extractor randomly selected additional sentences in the corpus that contained a pair of seed instances. For each seed instance pair, it selected ten sentences to evaluate the precision of LABELOCCURRENCES within a single document cluster, called the “inside cluster” sentences. Since there might not be a single document cluster that had ten occurrences of a seed instance pair, inside cluster sentences were added when there were at least two within a single cluster: the extractor continued randomly adding all matching sentences from suitable clusters until the number of sentences minus the number of document clusters they were drawn from equaled or exceeded ten. This method is shown formally in algorithm 2. The function SENTENCESRELATING(si, c) finds all sentences in the Algorithm 2 Gather inside cluster sentences Input: si: An instance pair. C: The parallel corpus, composed of sets c of document clusters. Output: ICSsi: The inside cluster sentences for the instance pair si. It is a set of sets of sentences relating the two instances. Each contained set is drawn from the same document cluster. Procedure: ICSsi ← ∅ size ← 0 for c ∈ C do S ← SENTENCESRELATING(si, c) if |S| ≥ 2 then ICSsi ← ICSsi ∪ {S} size ← size+ |S| − 1 end if if size ≥ 10 then return ICSsi end if end for Algorithm 3 Gather outside cluster sentences Input: si: An instance pair. C: The parallel corpus, composed of sets c of document clusters. Output: OCSsi: The outside cluster sentences for the instance pair si. It is a set of sentences relating the two instances. Procedure: OCSsi ← ∅ for c ∈ C do S ← SENTENCESRELATING(si, c) if |S| ≥ 1 then OCSsi ← OCSsi ∪ {RANDOMELEMENT(S)} end if if |OCSsi| ≥ 10 then return OCSsi end if end for document cluster c that relate the seed instance pair si. Note that although the for loop appears to be iterating over the corpus in a fixed order, the document clusters were sampled randomly without replacement. The sentence gathering algorithm is run for each seed instance pair from each target relation. We also selected another ten sentences spread across document clusters to test the precision of LABELOCCURRENCES without document clustering, called the “outside cluster” sentences. Though somewhat obvious, for completeness, this is shown in algorithm 3. We required the sentences found in this way to have a close syntactic relationship between the two seed instances. We measured this by parsing the sentences and rejecting any sentence where the words were separated by a dependency path of length greater than five.
منابع مشابه
Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping
Information extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic lexicon and extraction patterns simultaneously. As input, our technique requires only unannotated training texts and a handful of seed words for a category. We use a mutual boo...
متن کاملOpen Relation Extraction for Polish: Preliminary Experiments
This paper presents preliminary experiments on Open Relation Extraction for Polish. In particular, a variant of a priorart algorithm for open relation extraction for English has been adapted and tested on a set of articles from Polish on-line news. The paper provides initial evaluation results, which constitute the point of departure for in-depth research in this area.
متن کاملNews Image Annotation on a Large Parallel Text-image Corpus
In this paper, we present a multimodal parallel text-image corpus, and propose an image annotation method that exploits the textual information associated with images. Our corpus contains news articles composed of a text, images and image captions, and is significantly larger than the other news corpora proposed in image annotation papers (27,041 articles and 42,568 captionned images). In our e...
متن کاملGraph-Based Seed Set Expansion for Relation Extraction Using Random Walk Hitting Times
Iterative bootstrapping methods are widely employed for relation extraction, especially because they require only a small amount of human supervision. Unfortunately, a phenomenon known as semantic drift can affect the accuracy of iterative bootstrapping and lead to poor extractions. This paper proposes an alternative bootstrapping method, which ranks relation tuples by measuring their distance ...
متن کاملMulti-level Bootstrapping For Extracting Parallel Sentences From a Quasi-Comparable Corpus
We propose a completely unsupervised method for mining parallel sentences from quasi-comparable bilingual texts which have very different sizes, and which include both in-topic and off-topic documents. We discuss and analyze different bilingual corpora with various levels of comparability. We propose that while better document matching leads to better parallel sentence extraction, better senten...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011